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1.  Introduction 


Even  in  the  execution  of  known  routine  procedures,  people  make  nonrandom  errors. 
Everyone  has  had  this  experience,  whether  it  is  leaving  one’s  bank  card  in  an  ATM  or  failing  to 
attach  a  promised  file  to  an  email  message.  While  many  such  errors  have  little  or  no  real  cost, 
many  such  errors  have  dire  consequences,  including  loss  of  human  life  (Casey,  1998).  In  many 
situations  faced  by  Navy  personnel,  the  potential  consequences  tend  toward  the  severe  end  of 
this  range  (e.g.,  the  Vincennes  incident  in  the  1991  Gulf  War).  An  understanding  of  the 
mechanisms  underlying  such  errors,  and  therefore  ultimately  knowledge  about  how  to  potentially 
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defeat  them,  would  clearly  be  valuable  to  the  Navy,  other  Department  of  Defense  organizations, 
and  private  industry.  Clearly,  an  understanding  of  the  cognitive  and  perceptual  mechanisms 
underlying  such  error,  and  therefore  knowledge  about  how  to  potentially  defeat  them,  would  be 
valuable.  This  is  particularly 

However,  this  problem  has  spawned  surprisingly  little  research.  Senders  and  Moray 
(1991,  p.  2)  identify  probably  the  major  explanation:  “one  reason  for  this  is  that  error  is 
frequently  considered  only  as  result  or  measure  of  some  other  variable,  and  not  a  phenomenon  in 
its  own  right.”  Empirical  work  on  systematic  errors  in  the  execution  of  routine  procedures  is 
dominated  by  anecdotal  accounts  (e.g.,  Casey,  1998)  but  controlled  experiments  on  this  subject 
are  quite  rare.  The  dominant  theoretical  paradigm  in  this  area  is  certainly  the  one  proposed  by 
Reason  (1990),  which  is  more  a  taxonomy  than  a  theory.  Reason  classifies  errors  into  two  types: 
“mistakes,”  which  are  the  result  of  forming  an  incorrect  intention  to  act,  and  “slips,”  which  are 
failures  to  correctly  execute  an  intention.  These  are  tied  to  Rasmussen’s  (1987)  skill-rule- 
knowledge  (SRK)  hierarchy  of  skill  acquisition.  Mistakes  are  usually  errors  at  the  knowledge 
level;  that  is,  the  person  making  the  error  has  incorrect  knowledge  about  how  to  perform  the  task. 
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While  this  is  certainly  the  explanation  for  many  errors,  it  does  not  appear  to  apply  to  many 
interesting  forms  of  systematic  procedural  error  in  which  the  person  does  know  the  correct  set  of 
steps.  Reason  generally  attributes  errors  at  the  skill  or  rule  levels  of  performance,  to  “lapses  of 
attention.”  From  the  perspective  of  trying  to  understand  the  causes  of  such  errors,  this  is  not 
helpful.  First,  it  is  at  best  postdictive,  not  predictive.  Second,  it  simply  shifts  the  locus  of  the 
problem  to  another  area  of  psychology  which  is,  at  best,  ill-defined. 

This  research  represents  an  effort  to  improve  this  situation  within  a  particular  domain, 

that  of  errors  in  the  execution  of  routine  procedures.  Routine  procedures  are  those  which  fall 
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under  the  heading  of  routine  cognitive  skill  as  defined  by  Card,  Moran,  and  Newell  (1983)  and 
John  and  Kieras  (1996).  Such  a  skill  is  one  where  the  person  executing  the  skill  has  the  correct 
knowledge  of  how  to  perform  the  task  and  simply  needs  to  execute  that  knowledge.  Roughly 
speaking,  that  can  be  thought  of  as  the  point  where  people  are  no  longer  problem  solving,  but 
rather  applying  proceduralized  knowledge  to  a  relatively  familiar  task. 

This  level  of  skill  has  been  the  focus  of  attention  for  an  entire  family  (the  GOMS  family; 
John  &  Kieras,  1996)  of  techniques  for  analysis  and  execution  time  prediction.  This  is  largely 
due  to  the  fact  that  such  a  wide  array  of  situations  fall  under  this  classification,  from  occasional 
but  not  infrequent  programming  of  VCRs  to  situations  involving  highly-motivated  people  in 
safety-critical  situations,  such  as  commercial  pilots  and  medical  professionals.  As  noted,  GOMS, 
which  stands  for  goals,  operators,  methods,  and  selection  rules,  is  one  of  the  primary  techniques 
for  predicting  human  performance  under  these  conditions,  and  the  empirical  success  of  GOMS  is 
well-documented  (again,  see  John  &  Kieras,  1996).  A  typical  GOMS  analysis  is  based  on  a 
hierarchical  goal  decomposition  and  then  a  listing  of  the  primitive  operators  needed  to  carry  out 


page  3 


the  lowest-level  goals.  Thus,  GOMS  analyses  are  highly  sensitive  to  the  goal-based  task  structure 
and  the  number  of  primitive  operations  required. 

However,  GOMS-class  analyses  do  not  take  into  account  visual  factors  of  the  interface 
such  as  the  layout  of  the  controls  used  when  executing  a  procedure.  Furthermore,  the  model  of 
cognitive  control  underlying  the  GOMS  approach,  goal  stacks,  does  not  appear  to  be  adequate  to 
explain  well-known  error  types,  particularly  postcompletion  errors.  (Postcompletion  errors  are 
errors  in  which  the  operator  omits  a  step  or  subgoal  of  the  procedure  where  that  step  or  subgoal 
occurs  after  the  main  goal  of  the  task  is  completed.  Examples  include  leaving  a  bankcard  in  an 
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automated  teller  machine  or  driving  off  without  the  gas  cap  after  filling  the  tank.)  While  GOMS 
analyses  can  identify  where  postcompletion  errors  might  occur,  they  do  not  explain  why  some 
cue-based  mitigation  strategies  are  effective  and  why  others  are  not. 

Issues  of  cognitive  control,  and  in  particular,  cognitive  control  of  vision,  are  likely  to 
become  increasingly  important  as  more  and  more  interfaces  become  visual  and  visual  interfaces 
are  deployed  more  widely  (such  as  cell  phones,  PDAs,  and  in-car  navigation  systems).  There  are 
few  researchers  working  at  the  boundary  between  vision  and  cognition  and  even  fewer  who  are 
also  concerned  with  how  such  performance  impacts  how  people  make  errors. 

Since  GOMS  is  the  dominant  tool  for  understanding  human  performance  in  such  tasks, 
the  fact  that  it  cannot  accommodate  these  results  is  significant  for  anyone  who  wants  to  predict  or 
explain  how  people  execute  routine  procedures.  Because  errors  (and  sometimes  even  simple 
slowdowns)  in  the  execution  of  routine  procedures  can  have  such  a  high  cost,  it  is  important  to 
have  not  only  a  thorough  understanding  of  such  effects,  but  ultimately  to  have  a  model  which 
predicts  how  people  will  perform  when  executing  routine  procedures. 
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2.  Common  Experimental  Methods 

Multiple  experiments  were  completed  in  this  funding  period.  All  of  these  experiments 
have  as  their  basis  a  common  set  of  experimental  tasks  and  methods,  so  these  common  methods 
will  be  described  in  some  detail  so  the  reader  may  become  familiar  with  them.  These  methods 
are  derived  from  the  methods  used  in  Byrne  and  Bovair  (1997). 

There  were  two  primary  tasks  involved,  set  in  a  fictional  “Star  Trek”  setting  to  engage 
experimental  participants.  (Rice  University,  where  the  experiments  were  conducted,  has  a  strong 
science  and  engineering  presence  and  thus  this,  was  indeed  an  engaging  cover  story.)  Both  tasks 
were,  as  described,  routine  procedures  which  the  participants  simply  had  to  memorize.  Each 
procedure  was  broken  into  subgoals  on  which  participants  were  explicitly  instructed.  Recall  that 
GOMS  analysis  represents  the  state-of-the-art  in  terms  of  task  analysis  for  routine  procedural 
tasks.  What  such  an  analysis  predicts  is  that  two  tasks  with  the  same  goal/method/operator 
structure  should  produce  identical  performance.  These  two  tasks  had  the  same  basic  goal/ 
method/operator  structure  and  are  thus  termed  “GOMS-isomorphic.” 

The  subgoals  and  steps  in  each  task  are  listed  in  Table  1  and  the  displays  for  each  task  are 
presented  in  Figure  1 .  Participants  were  trained  to  a  performance  criterion  (four  error-free  trials) 
on  each  task  in  the  first  experimental  session  and  then  returned  approximately  one  week  later  for 
a  second  session.  During  the  second  session,  the  experiment  program  emitted  warning  beeps  on 
error  commission.  A  concurrent  working  memory  letter  task  was  also  introduced  on  the  day  of 
testing.  As  in  the  study  by  Byrne  and  Bovair  (1997),  its  function  was  to  increase  working 
memory  load  during  task  performance.  Participants  were  presented  with  auditory  stimuli  in  the 
form  of  randomly  ordered  letters  spoken  through  the  headphones  at  a  rate  of  one  letter  every 
three  seconds.  A  tone  was  presented  randomly  at  intervals  ranging  from  nine  to  forty-five 
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seconds  upon  which  the  participants  were  directed  to  recall  the  last  three  letters  in  order  and 
type  them  into  the  text  box  that  appeared  on  the  screen. 

Participants  were  encouraged  to  work  both  accurately  and  quickly  by  means  of  a  scoring 
system,  an  onscreen  timer,  and  prizes.  The  scoring  system  incremented  points  for  each  correctly 
executed  step  and  decremented  points  for  each  incorrect.  Bonus  points  were  awarded  for  task 
completion  within  a  set  time.  The  exact  scoring  scheme  used  varied  slightly  from  experiment  to 
experiment.  The  number  of  trials  of  each  task  completed  in  the  second  session  also  varied 
slightly  from  experiment  to  experiment;  most  tvere  in  the  range  of  12-14  times  per  task. 


Table  1.  Subgoals  and  steps  for  the  Phaser  and  Transporter  tasks  used  in  the  experiments. 


Step  # 

Phaser 

Transporter 

First  subgoal 

1 

Power  Connected 

Scanner  On 

2 

Charge 

Active  Scan 

3 

Stop  Charging 

Lock  Signal 

4 

Power  Connected 

Scanner  Off 

Second  subgoal 

5 

Settings 

Enter  Frequency 

6 

<slider> 

<type> 

7 

Focus  Set 

Accept  Frequency 

Third  subgoal 

8 

Firing 

Transporter  Power 

9 

Tracking 

Synchronous  Mode 

10 

<track-and-space> 

<track-and-click> 

11 

Tracking 

Synchronous  Mode 

12 

Main  Control 

Main  Control 
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Phaser 

Power 

Output 


■xiM 


Phaser  Focus  Index 


Status 


Elapsed  Time 


j  Charge 


@  Battery 
O  Settings 
©  Firing 


□  Power  Connected 
[H  Focus  Set 


Status 


Elapsed  Time 


0  Transporter  Power 


□  Accept  Frequency 


©Scanner  On 
®  Scanner  Off 
©  Active  Scan 
®  Lock  Signal 


Main'  Control 


Figure  1b.  Task  display  for  the  “transporter”  task. 


There  were  two  primary  measures  of  performance,  error  frequency  and  step  completion 
time,  it  was  possible  to  have  several  opportunities  within  a  single  trial  to  commit  an  error  at  each 
action  step.  Error  frequency  is  defined  as  the  number  of  errors  at  step  Xi  divided  by  the  number 
of  opportunities  for  error  at  step  Xi.  Each  step  can  be  considered  a  sequential  choice  (Ohlsson, 
1996),  so  the  definition  was  based  on  the  step,  not  the  action.  That  is,  each  step  was  counted  as 
either  containing  an  error  or  not  containing  an  error,  regardless  of  the  number  of  incorrect  actions 
taken.  For  example,  if  a  participant  was  at  step  3  of  the  procedure  and  clicked  one  incorrect 
button,  that  means  an  error  was  made  at  at  step  3.  Further  incorrect  clicks  made  there  do  not 
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advance  the  state  of  the  procedure,  so  they  were  not  counted  as  additional  errors.  This  focuses 
the  analysis  on  where  in  the  procedure  the  errors  occurred.  Step  completion  time  was  measured 
at  the  time  elapsed  (in  milliseconds)  between  the  successful  execution  of  the  previous  step  and 
the  successful  execution  of  the  current  step,  that  is,  an  inter-click  latency.  Steps  containing  errors 
were  omitted  from  this  analysis. 

3.  Major  Findings 

3.1  Empirical  Results:  Layout  Error 

The  first  important  question  was  whether  or  not  two  tasks  which  are  GOMS-isomorphic 
could  generate  markedly  different  performance  in  terms  of  both  execution  time  and  error  rates. 
Figure  2,  which  is  based  on  the  aggregate  from  3  different  experiments,  shows  how  two  GOMS- 
isomorphic  tasks  can  differ  in  error  frequency,  and  Figure  3  shows  the  differences  in  step 
completion  time.  (Note  that  not  all  steps  are  included  in  the  step  completion  time  analysis 
because  some  of  the  times  are  not  simple  inter-click  latencies  but  include  other  activities  such  as 
waiting  for  the  interface  or  performing  a  tracking  task.) 

These  results  are  quite  conclusive  as  to  the  fact  that  the  tasks  did  indeed  have  different 
performance  profiles,  though  they  did  not  allow  a  clear  assessment  as  to  why  the  two  tasks  were 
different.  Further  research,  however,  was  able  to  provide  insights  into  this  difference. 


page  9 


Error  Frequency 


1  23456789  10 

Step  Number 

Figure  2.  Error  frequency  as  a  function  of  step  number  in  two  GOMS-isomorphic  tasks  (Phaser 
and  Transporter).  Error  bars  depict  95%  confidence  intervals  based  on  164  subjects. 
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Figure  3.  Step  completion  time  as  a  function  of  step  number  in  two  GOMS-isomorphic  tasks 
(Phaser  and  Transporter).  Error  bars  depict  95%  confidence  intervals  based  on  164  subjects. 


Human  performance  on  these  tasks  is  highly  sensitive  to  the  layout  of  the  controls  (e.g., 
buttons,  sliders)  on  the  display.  The  initial  results  were  based  on  two  tasks  isomorphic  in  GOMS 
structure  but  differing  in  control  layout  (and  cover  story).  Subsequent  experiments  explicitly 
manipulated  control  layouts  and  yielded  the  following  discoveries: 

First,  visual  grouping  makes  a  substantial  difference  in  performance.  Grouping  controls 
according  to  the  organization  of  subtasks  clearly  yields  superior  performance  to  grouping  based 
on  control  type  (i.e.,  all  radio  buttons  in  one  group,  all  pushbuttons  in  another,  etc.).  See  results 
presented  in  Figures  4  and  5,  which  clearly  show  that  error  rate  and  task  execution  rate  are 
affected  by  how  the  controls  on  the  display  are  visually  grouped.  Note  that  in  a  replication  study 
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of  the  “Task”  grouping,  the  error  rate  for  step  6  was  substantially  reduced,  providing  further 
evidence  that  visual  grouping  interacts  with  task  structure. 


1  2  34  5  6  789  10 


Step  Number 

Figure  4.  Error  frequency  for  different  control  layouts.  Control  =  original  layout;  Task  =  controls 
goruped  by  subtask;  Type  =  controls  grouped  by  control  type. 
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Figure  5.  Step  completion  time  for  different  control  layouts.  Control  =  original  layout;  Task  = 
controls  goruped  by  subtask;  Type  =  controls  grouped  by  control  type. 

Second,  users  are  sensitive  to  both  global  and  local  constraints  on  where  controls 
“should”  be.  That  is,  they  expect  consistency  with  global  considerations  like  reading  order. 
However,  they  are  also  sensitive  to  local  considerations  like  how  controls  have  been  organized  in 
other  parts  of  the  display.  Violating  either  constraint  can  lead  users  to  err.  See  results  presented  in 
Figure  6  which  depicts  error  rate  at  a  particular  step  in  one  of  the  procedures  (step  8  in  the 
Transporter;  see  Figure  2  to  see  how  this  compares  to  other  steps).  In  this  experiment,  two  of  the 
conditions  were  inconsistent  with  expectation,  violating  either  local  or  global  consistency.  In  the 
two  other  conditions,  this  step  was  consistent  with  global  expectations  and  the  local  ordering  of 
the  prior  subgoal  (“subgoal  consistent”  condition)  or  with  all  prior  subgoals  (“full  consistent” 
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condition).  Clearly,  both  constraints  must  be  honored  in  order  to  reduce  error  rates  to 
subsystematic  frequency  (i.e.,  less  than  5%). 


Figure  6.  Error  rate  at  step  8  in  the  Transporter  task  by  condition.  See  text  for  further 
explanation.  Error  bars  represent  one  standard  error  of  the  mean. 

Performance  is  surprisingly  insensitive  to  the  surface  features  of  the  controls  themselves 
or  the  number  of  extraneous  controls.  Users  apparently  do  not  use  the  local  state  of  the  controls 
(e.g.,  checked  state  of  checkboxes)  to  track  task  progress  in  such  routine  tasks.  Similarly,  adding 
many  extraneous  controls  has  little  impact  on  performance.  See  results  presented  in  Figure  7, 
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which  shows  step  completion  time  for  the  control  version  of  the  Phaser  task  along  with  two 
variants,  the  “extra  buttons”  variant  and  the  “push  buttons”  variant.  In  the  “extra  buttons” 
condition,  numerous  extraneous  buttons  were  added  to  the  display.  In  the  “push  buttons” 
condition,  all  the  buttons  were  converted  into  push  buttons  rather  than  button  types  which 
display  state  information  (e.g.,  radio  buttons  or  checkboxes).  Similar  results  were  obtained  for 
the  Transporter  task.  It  should  be  noted  that  GOMS  models  have  nothing  to  say  about  any  effects 
of  additional  buttons  or  a  lack  of  state  information  on  the  interface. 
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Figure  7.  Step  completion  time  by  button  type  condition.  See  body  text  for  full  explanation. 


3.2  Computational  Cognitive  Modeling 

Initial  models  of  these  tasks  were  constructed  using  the  ACT-R  cognitive  architecture 
(Anderson,  et  al.,  2004).  These  models  are  described  in  more  detail  in  Byrne,  Maurier,  Fick,  and 
Chung  (2004),  which  is  presented  in  Appendix  A  and  so  the  detail  presented  here  will  be  limited. 
While  the  modeling  efforts  have  not  proceeded  as  rapidly  as  hoped,  due  primarily  to  the 
preponderance  of  empirical  results  which  were  unforeseen,  the  modeling  work  has  nonetheless 
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generated  important  insights.  First,  it  should  be  noted  that  the  models  have  achieved  “ballpark” 
accuracy  at  reproducing  the  human  execution  time  data.  However,  the  models  do  not  yet  err  with 
human-like  frequency.  The  most  important  result  achieved  to  date  with  the  modeling  work 
concerns  the  relative  importance  of  cognitive  control  structures  (e.g.,  goal  management)  vs. 
visual  search  control.  GOMS-style  accounts  have  traditionally  emphasized  the  former,  but  our 
modeling  work  clearly  demonstrates  that  the  latter  can  be  at  least  as  important  for  modeling 
routine  procedures,  the  intended  domain  for  GOMS  models. 

The  other  important  insight  which  came  from  the  modeling  work  was  the  inspiration  for 
the  empirical  investigation  into  surface  features  of  the  buttons  described  in  the  previous  section. 
This  came  directly  from  examination  of  the  model,  which  itself  does  not  make  use  of  state 
information  and  is  only  slightly  affected  by  extraneous  controls. 

3.3  Empirical  results:  Postcompletion  Error 

One  important  class  of  error  which  can  occur  in  routine  procedural  tasks  are 
postcompletion  errors.  These  are  omissions  of  some  step  or  subgoal  of  the  procedure  which  has 
the  property  that  it  must  be  executed  after  the  main  goal  for  the  task  has  been  satisfied.  Standard 
examples  include  leaving  one’s  bankcard  in  an  automated  teller  machine  or  leaving  the  original 
document  on  the  glass  of  a  photocopier  or  flatbed  scanner.  While  multiple  authors  had 
commented  about  postcompletion  errors  (e.g..  Poison,  et  al.,  1992;  Young,  et  al .,  1989),  the  first 
laboratory  demonstration  of  this  error  was  Byrne  and  Bovair  (1997).  Byrne  and  Bovair  suggested 
that  this  error  is  so  robust  under  conditions  of  high  working  memory  load  that  the  only  viable 
solution  is  to  design  them  out.  However,  Altmann  and  Trafton  (2002)  suggested  that  these  errors 
could  be  mitigated  with  appropriate  cueing. 
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Thus,  a  number  of  cueing  manipulations  were  investigated  and  an  effective  cue  was 
indeed  found:  a  just-in-time,  highly  visually  salient,  highly  specific  cue  (red  and  yellow  blinking 
arrows  pointing  at  the  control  to  be  acted  upon).  This  cue  did  not  merely  reduce  the  incidence  of 
postcompletion  error;  it  entirely  eliminated  the  error.  See  Figure  8  for  results.  (This  work  is 
described  in  great  detail  in  Chung  and  Byrne,  2004,  which  appears  in  Appendix  B.) 

Since  another  highly  salient  visual  cue  (a  mode  indicator)  failed  to  act  as  a  mitigator, 
further  experiments  examined  the  relevant  properties  of  the  cue.  Weakening  the  cue  by 
presenting  it  prior  to  the  appropriate  time  rendered  the  cue  ineffective.  Using  a  cue  which  was 
less  specific  did  mitigate  the  error  somewhat,  but  not  to  the  same  degree.  Reducing  cue  salience 
(constant  red,  no  blinking),  but  retaining  specificity  and  appropriate  timing,  yielded  a  highly 
effective  cue.  See  Figure  8  for  one  of  these  results. 
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0.1 


Figure  8.  Postcompletion  error  frequency  by  mitigation  condition.  Cue  =  just-in-time  blinking  red 
arrows,  Mode  -  highly-salient  mode  indicator.  Error  bars  indicate  one  standard  error  of  the 

mean. 
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Figure  9.  Postcompletion  error  frequency  vs.  cue  type.  Dot  =  just-in-time  blinking  dot  (salient, 
nonspecific),  Arrows  =  just  in  time  non-blinking  arrows  (less  salient,  specific).  Error  bars  indicate 

one  standard  error  of  the  mean. 


4.  Other  Considerations 

4.1  Theoretical  Perspectives 

Finally,  another  activity  undertaken  during  the  period  of  support  was  the  further  consideration  of 
how  computational  cognitive  models  can  be  used  to  both  drive  basic  research  as  well  as  inform 
real-world  applications.  This  has  included  work  on  integrative  approaches,  surveys  of  the 
modeling  literature,  and  position  pieces  on  the  role  of  modeling. 
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4.2  Technology  Transfer 

In  the  near  term,  technology  transfer  for  the  funded  work  under  this  grant  is  happening 
primarily  as  a  function  of  an  SBIR  given  by  ONR  to  D.  N.  American,  on  which  the  PI  for  this 
grant  has  been  hired  as  a  consultant.  Insights  gained  from  the  funded  research  can  be  brought  to 
bear  through  this  conduit  more  rapidly  than  through  traditional  journal  publication  channels. 
This  transfer  will  includes  both  computational  modeling  methodology  as  well  as  insights  into 
modeling  human  error.  Additionally,  data  collected  in  the  ONR-funded  effort  may  also  be  shared 
with  D.  N.  American  to  help  accelerate  their  Navy-oriented  SBIR  work. 

In  the  longer  term,  technology  transfer  should  happen  through  multiple  channels.  One  is 
the  publication  and  presentation  of  empirical  and  modeling  results  in  conferences  and  journals, 
making  them  widely  available.  This  has  obviously  begun,  but  further  publications  are  in 
progress.  Working  with  the  ACT-R  architecture  enables  another  more  subtle  form  of  technology 
transfer.  Because  the  PI  is  one  of  the  system  architects,  any  enhancements  made  to  the 
architecture  as  a  result  of  this  research  will  be  propagated  to  a  larger  community  of  researchers, 
namely  the  ACT-R  modeling  community,  which  includes  researchers  at  various  Navy  sites  as 
well  as  others  in  the  DoD  community. 
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Abstract 

A  major  domain  of  inquiry  in  human-computer  interaction 
is  the  execution  of  routine  procedures.  We  have  collected 
extensive  data  on  human  execution  of  two  procedures 
which  are  structurally  isomorphic,  but  not  visually 
isomorphic.  Extant  control  approaches  (e.g.,  GOMS) 
predicts  they  should  have  the  same  execution  time  and 
error  rate  profiles,  which  they  do  not.  We  present  a  series 
of  ACT-R  models  which  demonstrate  that  control  of  visual 
search  is  likely  a  key  component  in  modeling  similar 
domains. 

Introduction 

Every  day,  people  execute  countless  procedures  which  are 
more  or  less  routine.  Many  of  these  are  uninteresting,  but 
many  of  these  occur  in  contexts  such  as  emergency  rooms 
and  command-and-control  centers  where  failures  of  speed  or 
correctness  can  have  serious  consequences.  Thus, 
understanding  how  humans  execute  routine  procedures  is 
critical  in  at  least  some  domains.  Card,  Moran,  and  Newell 
(1983)  and  John  and  Kieras  (1996)  define  a  routine 
cognitive  skill  as  one  where  the  person  executing  the  skill 
has  the  correct  knowledge  of  how  to  perform  the  task  and 
simply  needs  to  execute  that  knowledge.  Roughly  speaking, 
that  can  be  thought  of  as  the  point  where  people  are  no 
longer  problem  solving,  but  rather  applying  proceduralized 
knowledge  to  a  relatively  familiar  task. 

This  level  of  skill  has  been  the  focus  of  attention  for  an 
entire  family  (the  GOMS  family;  John  &  Kieras,  1996)  of 
techniques  for  analysis  and  execution  time  prediction.  This 
is  largely  due  to  the  fact  that  such  a  wide  array  of  situations 
fall  under  this  classification,  from  occasional  but  not 
infrequent  programming  of  VCRs  to  situations  involving 
highly-motivated  people  in  safety-critical  situations,  such  as 
commercial  pilots  and  medical  professionals.  As  noted, 
GOMS,  which  stands  for  goals,  operators,  methods,  and 
selection  rules,  is  one  of  the  primary  techniques  for 
predicting  human  performance  under  these  conditions,  and 
the  empirical  success  of  GOMS  is  well-documented  (again, 
see  John  &  Kieras,  1996).  A  typical  GOMS  analysis  is 
based  on  a  hierarchical  goal  decomposition  and  then  a 
listing  of  the  primitive  operators  needed  to  carry  out  the 
lowest-level  goals.  Thus,  GOMS  analyses  are  highly 
sensitive  to  the  goal-based  task  structure  and  the  number  of 
primitive  operations  required. 

What  such  an  analysis  predicts  is  that  two  tasks  with  the 
same  goal/method/operator  structure  should  produce 
identical  performance.  While  this  may  be  true  in  a  great 


many  cases,  it  is  not  universally  true.  We  will  present  data 
from  two  tasks  which  would  yield  equivalent  GOMS 
models  (which  we  term  “GOMS-isomorphic”)  but  produce 
significantly  different  profiles  in  terms  of  the  time  of 
execution  for  each  step,  as  well  as  the  error  rates  at  each 
step.  This  is  not  intended  as  a  criticism  of  the  GOMS 
modeling  approach,  but  rather  as  the  identification  of  an 
opportunity  for  improvement. 

•i »  This  presentation  will  focus  on  performance  in  a  series  of 
laboratory  experiments  in  which  participants  were  trained  on 
a  number  of  relatively  simple  computer-based  tasks  and  then 
in  a  subsequent  session,  returned  to  perform  those  tasks 
along  with  a  concurrent  memory-loading  task.  This 
paradigm  is  essentially  the  same  as  that  used  in  Byme  and 
Boviar  (1997),  which  focused  on  a  particular  type  of 
procedural  error,  the  postcompletion  error.  This  line  of 
research  is  primarily  concerned  with  errors  made  in  the  task, 
but  to  fully  understand  the  errors  made,  we  felt  it  would 
first  be  necessary  to  understand  the  cognitive  control 
structures  which  would  produce  execution  times  similar  to 
those  we  found  in  the  lab.  In  order  to  understand  these 
experiments,  a  relatively  thorough  understanding  of  the 
tasks  is  required. 

The  Tasks 

Common  Procedures 

The  two  tasks  under  examination  were  both  set  in  a  fictional 
Star  Trek  setting  to  encourage  engagement  of  the 
undergraduate  participants.  Participants  came  in  for  two 
sessions  spaced  roughly  one  week  apart.  The  first  session 
was  training,  in  which  participants  were  given  a  description 
for  each  task  and  a  manual,  walked  through  the  task  once 
with  the  manual  in  hand,  and  then  had  to  repeat  each  task 
until  they  performed  it  without  error  three  times.  In  the 
second  session,  participants  performed  the  tasks  on  which 
they  were  trained  in  the  first  session,  along  with  a 
concurrent  memory-loading  task.  In  this  task,  they  had  to 
monitor  a  stream  of  spoken  letters  which  was  occasionally 
interrupted  with  a  beep,  after  which  they  responded  with  the 
last  three  letters  heard.  Participants  earned  points  for 
correctly  executed  steps,  lost  points  for  errors,  received 
bonus  points  for  rapid  performance,  and  lost  points  for 
incorrect  answers  to  the  memory  probes.  High  scorers 
received  additional  compensation. 

While  participants  were  trained  on  several  tasks,  not  all 
of  which  were  the  same  from  experiment  to  experiment,  the 
current  research  is  focused  on  two  tasks,  called  the  Phaser 
and  the  Transporter.  These  two  tasks  are  isomorphic  in  that 
they  have  the  same  number  of  steps  which  were  grouped  in 


the  training  manuals  in  the  same  subgoals.  The  names  of 
those  goals,  and  the  names  of  the  buttons  and  some  of  the 
displays  and  actual  controls,  however,  were  different 
between  the  two  tasks. 

The  displays  for  the  two  tasks  appear  in  Figures  1  and 
2  and  the  list  of  subgoals  and  steps  appears  in  Table  1.  The 
main  goal  in  the  Phaser  task  is  to  destroy  the  hostile 
Romulan  vessel;  the  main  goal  in  the  Transporter  task  is  to 
energize  it  to  return  some  crewmembers  to  safety.  One  of 
the  immediately  obvious  visible  differences  between  the  two 
layouts  is  that  the  controls  for  the  Transporter  are  visually 
grouped  according  to  subgoal  while  in  the  Phaser  they  are 
not. 


Figure  1.  Phaser  task  display 


arrow  keys  for  the  Phaser  and  with  the  mouse  for  the 
Transporter.  All  of  the  other  steps  required  the  simple 
clicking  of  a  button. 

The  exact  responses  of  the  display  and  some  of  the  task 
structure  did  differ  between  the  two  tasks  for  Steps  1 1  and 
12  as  part  of  manipulations  concerned  with  postcompletion 
errors,  so  those  steps  were  excluded  from  all  present 
analyses. 

The  major  dependent  variables  of  interest  here  were  step 
completion  time  and  error  frequency  Step  completion  time 
is  measured  as  the  time  between  clicks.  That  is,  the  time  for 
Step  2  in  the  Phaser  is  the  time  elapsed  between  the  click 
on  “Power  Connected”  and  the  click  on  “Charge.”  For  the 
first  step,  the  start  time  was  the  start  of  the  trial.  Steps  on 
which  errors  were  made  were  excluded  from  the  time 
analysis.  Times  for  steps  that  include  other  actions  (waiting, 
tracking)  were  be  excluded  from  the  analysis  because  this 
other  time  is  difficult  to  factor  out. 

Error  frequency  was  also  measured.  This  hinges  on  the 
definition  of  what  counts  as  an  error.  Each  step  can  be 
'considered  a  sequential  choice  (Ohlsson,  1996),  so  the 
definition  was  based  on  the  step,  not  the  action.  If  any 
incorrect  action  was  taken  at  a  step,  that  step  counted  as  an 
error,  regardless  of  the  number  of  incorrect  actions  taken. 

For  example,  if  a  participant  is  at  Step  4  in  the  Phaser  task, 
and  they  click  on  the  “Settings”  button  and  then  the 
“Firing”  button,  only  one  error  was  recorded  because  an 
error  was  made  at  that  step.  Frequency  was  calculated  as  the 
number  of  error-containing  steps  divided  by  the  total 
number  of  steps  executed. 

■  ■ 

Step#  -Phaser  -Transporter 


Figure  2.  Transporter  task  display 


There  are  some  other  important  features  to  note  as  well. 
After  Step  3  in  both  tasks,  the  participants  had  to  wait  until 
the  display  reached  an  acceptable  state  before  clicking  the 
next  button.  Step  6  in  both  tasks  involved,  or  could 
involve,  multiple  actions:  multiple  drag  adjustments  to  the 
slider  in  the  case  of  the  Phaser  and  multiple  keystrokes  in 
the  case  of  the  Transporter.  Step  10  in  both  procedures 
involved  a  somewhat  extended  tracking  task,  done  with  the 
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1  Power  Connected 
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•  Stop  Charging 
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■  Power  Connected 
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12 

1  Main  Control 

1 

I  Main  Control 

Table  1.  Steps  in  the  two  task  isomorphs ' 


Because  these  tasks  are  essentially  isomorphic,  there  is  no  a 
priori  reason  to  necessarily  expect  different  performance  on 
the  two  tasks  (through  Step  10),  except  perhaps  slightly 
longer  step  completion  times  for  those  steps  where  the 
mouse  has  further  to  go.  Nor  was  assessing  such  differences 
the  original  purpose  of  the  three  experiments  we  will  report; 
those  experiments  were  primarily  focused  on 
postcompletion  errors  in  the  Phaser  task. 

Results 

While  three  separate  experiments  were  run,  these 
experiments  differed  from  each  other  in  detail  only. 
Experiment  1  actually  included  subsequent  sessions  with  a 
variety  of  between-subjects  manipulations;  Experiment  2 
added  visual  cueing  at  the  postcompletion  step  of  the 
Phaser;  Experiment  3  used  a  cue  and  a  mode  indicator  to 
attempt  to  mitigate  postcompletion  effects  in  the  Phaser  at 
step  1 1;  the  exact  point  system  used  in  the  three 
experiments  differed  slightly;  etc.  However,  none  of  these  * 
surface  dissimilarities  made  much  difference;  the  results  are 
nearly  identical  for  all  three  experiments  (inclusion  of 
“experiment”  as  a  between-subjects  variable  reveals  no  main 
effects  or  interactions  involving  that  variable).  Across  the 
three  experiments,  data  from  a  total  of  164  participants  were 
used.  Figure  3  shows  the  results  for  step  completion  time 
for  the  Phaser  and  the  Transporter,  while  Figure  4  presents 
the  error  frequency.  Steps  3,  6,  and  10  are  excluded  from 
Figure  3  because  those  steps  involve  other  processes  (e.g., 
tracking)  or  possibly  multiple  actions,  as  described  above. 
Note  that  both  graphs  also  include  the  95%  confidence 
intervals  (non-pooled  error). 


So,  while  the  tasks  are  isomorphic  in  terms  of  subgoals 
and  steps,  they  produce  clearly  different  step  completion 
time  profiles.  This  runs  clearly  counter  to  any  account 
which  relies  entirely  on  the  GOMS-level  structures. 


Similarly,  if  one  assumes  that  the  same  failure  mechanisms 
are  in  operation  in  each  task,  the  two  tasks  should  produce 
identical  error  rate  profiles  as  well.  This  is  also  obviously 
not  the  case.  While  both  tasks  share  a  spike  in  error  rate  at 
step  4  in  the  procedure  (this  is,  in  fact,  a  postcompletion 
error),  the  Phaser  shows  other  spikes  at  steps  1  and  6  while 
the  Transporter  only  shows  another  spike  at  step  8.  Note 
that  these  spikes  in  error  rate  are  not  particularly  linked  to 
exceptionally  large  or  small  step  times,  either;  for  example, 
step  7  in  the  Phaser  is  particularly  slow,  but  is  not 
especially  error-prone.  Step  1  is  slow  in  both  tasks,  but 
only  markedly  error-prone  in  the  Phaser. 


Figure  4.  Mean  error  frequency  by  task  and  step 


Discussion 

These  data  are  obviously  problematic  for  any  account  which 
relies  solely  on  the  goal-subgoal-method  structure  for 
predicting  execution  time.  It  is  hard  to  know  how  a  GOMS- 
style  account  might  accommodate  these  data.  Subjects  were 
probably  not  at  the  level  of  skill  where  extreme  interleaving 
of  cognitive,  perceptual,  and  motor  operations  is  required  to 
model  their  peformance,  thus  it  is  not  clear  that  the  CPM 
variant  of  GOMS  (e.g.,  Gray,  John,  &  Atwood,  1993) 
would  be  appropriate.  This  is  not  to  say  that  motor 
operators  are  unimportant;  there  are  some  differences  in 
going  from  button  to  button  in  terms  of  pointing  time  as 
predicted  by  Fitts’s  law,  but  these  differences  are  relatively 
small  (as  will  be  shown  later). 

One  possibility  that  seems  straightforward  is  that  each  of 
these  buttons  has  to  be  visually  located  in  order  for  the 
mouse  to  be  moved  to  the  button  and  a  click  registered. 
However,  there  is  no  single  “visual  search”  operator  in 
GOMS  (or  ACT-R  or  Soar  for  that  matter)  which  would 
obviously  capture  the  differences  here.  Each  button  on  the 
display  is  at  least  approximately  equal  in  terms  of  visual 
salience;  while  one  might  argue  that  the  larger  gray 
pushbuttons  are  more  salient  and  should  thus  be  found 


faster,  there  is  little  difference  between  steps  8  and  9  of  the 
Transporter,  one  of  which  is  a  large  gray  button  and  the 
other  is  simply  a  labeled  checkbox.  Furthermore,  consider 
step  7  in  the  Phaser  is  markedly  slower  than  step  7  in  the 
Transporter  and  yet  both  are  simple  check  boxes  with  two- 
word  labels.  So,  if  the  difference  is  simply  in  a  “visual 
search”  operator,  this  operator  must  itself  be  driven  by 
something  substantially  more  sophisticated  than  what  is 
present  in  a  typical  GOMS  analysis.  Furthermore,  if  the 
only  difference  between  the  two  tasks  is  in  their  visual 
search  latencies,  the  source  of  the  differential  error  spikes 
remains  a  mystery. 

This  obviously  raises  the  question  of  what  kind  of  control 
structure  could  account  for  the  differences  between  these  two 
tasks?  Accounting  for  the  error  profiles  seems  extremely 
difficult  with  any  model  at  this  point;  generative  theories  of 
error  are  in  their  infancy  at  best  (though  that  is  ultimately 
our  goal,  see  also  Byrne,  2003).  Thus,  we  entered  into  a 
modeling  exploration  with  the  modest  goal  of  trying  to  , 
understand  what  drove  the  step  completion  times. 

Modeling 

We  constructed  a  number  of  models  of  this  task  using  ACT-, 
R  5.0  (Anderson,  et  al.,  in  press).  This  was  done  not  so 
much  because  of  a  strong  commitment  to  any  particular 
mechanism  in  ACT-R,  but  rather  because  ACT-R  contains 
the  full  suite  of  perceptual,  motor,  and  cognitive 
functionality  required  for  these  tasks.  It  is  likely  that  some 
version  of  Soar  or  EPIC  would  have  served  equally  well  for 
present  purposes  but  we  are  much  more  familiar  with  ACT- 
R  (and  fiirther  suspect  we  will  need  the  subsymbolic 
mechanisms  for  future  error  modeling). 

We  constructed  four  models  of  each  task.  It  was  our  hope 
that  this  way  we  might  “bracket”  performance  (Kieras  & 
Meyer,  2000;  Gray  &  Boehm-Davis,  2000)  and  see  if  the 
models  could  provides  reasonable  predictive  bounds.  The 
four  models  represented  a  crossing  of  two  dichotomies: 

Goal  organization.  The  first  dichotomy  was  whether  the 
model  used  a  hierarchical  representation  of  the  goal 
structure,  with  intermediate  subgoals  (e.g.,  “charge  the 
phaser”)  or  a  “flat”  goal  structure  where  12  low-level  goals 
were  simply  executed  in  sequence.  There  is  reason  to  believe 
that  even  well-practiced  experts  do  not  entirely  flatten  their 
goal  hierarchies  (Kieras,  Wood,  &  Meyer,  1997)  and  that,  in 
fact,  often  times  fairly  slow  retrieval-based  strategies  are 
appropriate  (Altmann  &  Trafton,  2002).  The  hierarchical 
goal  strategy  is  noted  with  “Hier”  in  the  model  label  and  the 
flat  with  “Flat.” 

Visual  search.  We  implemented  two  very  simplistic 
visual  search  strategies  here:  one  in  which  the  location  of 
each  button  had  to  be  determined  through  serial  visual 
examination  with  a  tendency  to  search  near  the  current  focus 
of  visual  attention  (Fleetwood  &  Byrne,  2003)  and  one  in 
which  the  model  is  assumed  to  have  declarative  knowledge 
of  the  locations  of  the  buttons  which  must  be  retrieved  for 
each  button.  Various  ACT-R  models  (Ehret,  2002; 

Anderson,  et  al.  in  press)  have  shown  that  this  kind  of 
learning  is  a  key  component  of  skill  development  in  similar 
interfaces.  The  unguided  serial  search  strategy  is  noted  with 
“DS”  (for  dumb  search)  and  the  alternate  with  “RL”  for 


“retrieve  location.” 

It  should  be  noted  that  in  ACT-R,  these  dichotomies  may 
interact.  ACT-R’s  visual  system  has  a  memory  for  which 
locations  (though  not  explicitly  which  objects)  have  been 
viewed  recently,  but  this  memory  decays  over  time  (we  used 
1.5  seconds  for  this  decay  time;  the  models  are  indeed 
sensitive  to  this  parameter  but  in  unusual  ways  which  are 
beyond  the  scope  of  this  presentation).  Thus,  additional 
time  spent  in  traversing  the  goal  hierarchy  can  result  in  the 
loss  of  this  information,  which  may  affect  the  time  course 
of  the  serial  visual  search. 

ACT-R  also  embeds  Fitts’s  law  for  prediction  of  mouse 
movement  times.  We  used  ACT-R  to  calculate  the  expected 
movement  time  between  the  various  buttons  to  make  clear 
the  movement  time  contribution  to  the  results.  We  did  not 
compute  it  for  step  1  because  the  initial  location  of  the 
cursor  was  not  recorded;  informal  observation  of  the 
participants  indicated  that  many  of  them  moved  the  mouse 
around  before  clicking  anyway. 

Finally,  these  models  are  all  stochastic.  Time  for  memory 
Retrievals  and  perceptual-motor  operations  in  ACT-R  can  be 
made  noisy  and  ACT-R  chooses  randomly  between  options 
in  various  subsystems  in  cases  of  ties,  so  each  run  of  the 
model  is  not  identical  to  the  last.  We  present  the  mean 
model-generated  times  for  100  runs  of  each  model. 


Model  Results 

Figure  5  presents  the  data  and  the  model  predictions,  as 
well  as  the  Fitts’s  Law  time,  for  the  Phaser  task.  Figure  6 
presents  the  same  for  the  Transporter. 
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Figure  5.  Model  and  data  for  the  Phaser  task 


None  of  the  four  models  provides  a  particularly  good 
fit;  which  model  is  the  “best”  model  by  fit  metric  depends 
on  which  metric  of  fit  one  uses:  by  r-squared,  the  best 
model  is  Flat  RL  at  0.73;  by  RMSD,  the  best  model  is  the 
Hier  DS  at  640  ms;  by  mean  absolute  deviation  (MAD)  the 
best  model  is  the  Hier  RL  model  at  26%.  These  are  fairly 


fine  distinctions  since  RMSD  ranged  from  640-739  ms  and 
MAD  ranged  from  26-33%.  Note  that  the  model  variant 
here  which  is  most  similar  to  a  GOMS-style  model  is  the 
Hier  RL  model.  This  model  uses  hierarchical  goal 
decomposition  as  per  GOMS,  and  essentially  has  a  fixed 
time  “find-on-screen”  operator  (the  retrieval  of  the  location). 
This  model  is  generally  good,  if  a  bit  too  fast,  for  the 
Transporter,  but  is  a  poor  model  for  the  Phaser. 


Figure  6.  Model  and  data  for  the  Transporter  task 

While  the  models  do  not  provide  outstanding  levels  of 
fit,  they  do  provide  some  important  insights.  First,  Fitts’s 
Law  alone  provides  an  r-squared  of  0.28  for  those  steps 
where  it  is  applicable.  Obviously  time  is  grossly  under¬ 
predicted  by  Fitts’s  Law — it  is  hardly  surprising  that  more 
is  going  on  here  than  simple  motor  movement,  though  it  is 
clearly  a  contributor. 

In  general,  the  ordinal  effects  one  would  expect  from 
the  basic  construction  of  the  models  held:  the  Flat  models 
were  faster  than  the  Hier  models  and  the  DS  models  were 
generally  faster  than  the  RL  models.  Note  that  in  general, 
the  RL  models’  performance  on  the  two  isomorphs  was 
quite  similar.  This  is  a  reflection  of  their  isomorphic  task 
structure.  The  DS  models,  on  the  other  hand,  reflect 
differences  between  the  tasks.  This  is  consistent  with  the 
notion  that  it  is  the  visual  aspects  of  the  display — the  DS 
models  are  sensitive  to  button  location  with  the  RL  models 
are  only  in  the  Fitts’s  law  sense — that  drives  the  differences 
between  the  two  tasks. 

However,  there  were  a  few  cases  where  the  DS  and  RL 
models  were  rougly  equivalent,  and  even  one  case  where  the 
DS  models  were  faster  (Phaser  step  4,  Power  Connected). 
There  were  some  degenerately  bad  performances  by  the  DS 
model,  notably  Phaser  step  9,  Tracking,  and  the  Hier  DS  on 
Transporter  step  5,  Enter  Frequency.  Both  of  these  cases 
involve  a  visual  shift  to  a  location  which  has  a  lot  of 
competition  from  items  much  closer  to  the  starting 
attentional  focus. 


On  the  other  hand,  the  slower  DS  resulted  in  better  fit 
in  some  instances,  namely  step  1  for  both  tasks,  and  steps  2 
and  3  for  the  Transporter.  Step  1  is  particularly  problematic 
for  all  four  models;  this  is  a  very  slow  step  for  both  the 
models  and  the  participants,  but  more  so  for  the 
participants.  We  suspect  this  is  due  to  some  kind  of  initial 
orienting  or  goal  construction  on  the  part  of  the  participants 
which  was  not  well-represented  in  the  model,  but  may  be 
partially  represented  by  the  DS  behavior  of  taking  an  initial 
visual  survey  of  the  display.  This  plays  into  the  next 
insight  we  gained  from  these  models. 

In  general,  the  RL  models  were  slightly  better  than  the 
DS  models.  What  this  suggest  to  us  is  that  participants  in 
this  case  are  at  an  intermediate  point  in  their  learning  of  the 
locations  of  the  objects  on  the  interface.  Our  next  model 
will  likely  not  start  with  the  locations  explicitly  encoded  in 
declarative  memory  but  will  instead  use  the  strategy  of 
attempting  to  retrieve  them  from  memory,  but  this  time 
from  the  memories  created  as  a  by-product  of  visual  searches 
conducted  along  the  way. 

Comparisons  between  the  Flat  and  Hier  models  are  also 
revealing.  These  models  differed  primarily  at  steps  where 
either  they  interacted  with  the  visual  search  process 
(Transporter  step  5,  Enter  Frequency  is  a  good  example  of 
this)  or  there  was  a  delay  for  additional  goal  traversal  (steps 
1,  5,  and  8.  This  additional  time  appears  correct  for  both 
tasks  for  steps  1  and  8,  but  step  5  indicates  something  else 
going  on.  Both  Hier  models  are  too  slow  for  step  5  in  the 
Phaser,  but  the  Hier  RL  model  is  right  on  target  for  step  5 
for  the  Transporter. 

Finally,  some  points  were  fairly  strategy-insensitive. 
Transporter  step  7  (Accept  Frequency)  was  fit  equally  well 
by  all  four  models.  This  is  an  interesting  case  for  two 
reasons:  [1]  the  DS  visual  search  strategy  will  almost 
always  search  the  correct  location  first  here  because  of  visual 
proximity  to  where  the  model  is  looking  prior  to  this  step, 
[2]  it  is  the  last  subgoal  within  the  second  goal,  and  thus 
not  differentially  affected  by  the  goal  organization,  and  [3] 
the  completion  time  for  the  similar  Phaser  step  is  radically 
different.  None  of  the  models  captured  this  deviant  time  in 
the  Phaser  at  all. 

Discussion 

While  it  may  appear  that  our  goal  was  to  somehow 
falsify  or  criticize  GOMS  models,  but  that  was  not  our 
intent.  Instead,  we  wanted  to  explore  where  and  why  models 
based  purely  on  the  GOMS-style  structure  would  misfit,  not 
for  the  purposes  of  finding  fault,  but  to  find  opportunities 
for  improvement.  One  of  the  primary  things  GOMS-style 
models  lack  is  a  consideration  of  the  visual  task  faced  by 
interface  users.  This  was  certainly  reasonable  when  most 
users  faced  command-line  tasks  which  were  indeed  primarily 
cognitive,  but  the  shift  to  increasingly  visual  interfaces  has 
raised  the  importance  of  systematically  addressing  the 
problem  of  how  the  visual  and  cognitive  systems  are 
integrated.  While  this  has  certainly  been  a  big  topic  for 
some  cognitive  scientists  for  many  years  (see  Pylyshyn, 

1 999  and  the  associated  commentary  for  an  excellent 
discussion),  it  has  not  been  a  prominent  theme  in 
computational  modeling  of  human-machine  interfaces  until 


fairly  recently  and  in  cases  where  the  task  is  clearly  defined 
as  primarily  a  visual  search  task  (e.g.,  Fleetwood  &  Byrne, 
2003;  Everett  &  Byrne,  2004;  Homof  &  Kieras,  1997, 
1999) .  Our  research  suggests  this  may  be  an  important  part 
of  routine  procedure  execution  even  when  visual  search  may 
not  appear  to  be  a  dominant  factor.  Furthermore,  it  appears 
that  it  is  neither  the  case  that  the  most  optimistic 
assumption  (users  memorize  the  location  of  all  controls)  or 
the  most  pessimistic  assumption  (users  search  randomly 
every  time)  is  an  appropriate  representation  of  user 
behavior,  at  least  at  this  level  of  skill.  This  suggests  that 
more  research  is  needed  on  the  integration  of  cognitive 
mechanisms  such  as  representing  and  traversing  goal 
structures  with  visual-cognitive  mechanisms  such  as  search 
strategies.  While  we  doubt  anyone  would  have  denied  that 
this  is  an  important  domain  in  a  general  sense,  we  suspect 
that  most  researchers  in  this  area  would  underestimate  the 
impact  such  considerations  might  have  on  execution  of 
routine  procedures. 

To  end  on  a  speculative  note,  consider  the  hint 
provided  by  Phaser  step  5  (settings).  In  that  case,  the  Hier 
models  are  too  slow  and  the  Flat  models  are  spot-on, 
suggesting  that  the  goal  traversal  performed  by  the  model  is 
not  being  done  by  the  participants.  This  might  be  an 
indicator  that  the  participants  have  re-configured  their 
internal  representation  of  the  task  structure  to  match  the 
visual  structure  of  the  interface!  This  suggests  a  possibly 
important  role  for  the  match  between  the  task  structure  and 
the  visual  layout  of  an  interface,  something  clearly  not 
predicted  by  extant  GOMS-class  models. 
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Appendix  B .  Chung  &  Byrne  (accepted) 


corresponding  human  response  will  facilitate  better  comprehension  of  the  mechanisms 


Despite  the  lack  of  reliable  differences  in  reaction  times  or  error  commission  at 


interaction,  more  likely  to  be  effective. 


general  findings  of  task  performance  degradation  under  situations  of  high  cognitive  load 
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extreme  end  of  this  range.  This  problem  has  received  surprisingly  little  attention  from  cognitive 
psychologists.  The  research  summarized  here  examines  such  errors  in  some  detail  both  empirically  and 
through  computational  cognitive  modeling.  There  were  several  key  results.  First,  many  such  errors  are 
sensitive  not  just  to  the  structure  of  the  task  but  also  to  the  layout  of  controls  and  displays, 
contrary  to  the  predictions  of  most  current  task  analysis  frameworks.  Some  such  errors  seem  to  be 
mitigable  by  simple  layout  changes.  Second,  a  particularly  pervasive  error  (termed  postcompletion 
error)  was  found  to  be  highly  resistant  to  cue-based  mitigation,  and  though  an  effective  cue  was  found, 
the  requirements  for  such  cues  are  difficult  to  meet  in  field  contexts.  Finally,  cognitive 
computational  models  constructed  using  the  ACT-R  cognitive  architecture  suggested  that  certain 
interface  manipulations  (removing  state  information,  adding  additional  extraneous  controls)  which 
appeared  major  would  actually  have  limited  impact  on  human  task  performance,  and  these  predictions  were 
validated  empirically. 
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